Comparing the variances of several treatments with that of a control treatment: Theory and applications

A common and important problem in medicine, economics and environmental studies is the comparison of the variances of several treatments with that of a control treatment. Among the existing methods, Spurrier’s optimal test based on multivariate F distribution has exact type I error rates. However, it requires equal sample sizes among the treatment groups. To extend the application scope, in this paper, we propose a new efficient test for comparing several variances with a control using the marginal inferential model (MIM). Simulation studies show that the MIM test guarantees the exact type I error rate whether the sample size is equal or unequal. Moreover, the power of the MIM test is competitive with that of Spurrier’s optimal test. Finally, two real examples are used to demonstrate the application of the proposed method.


Introduction
In medicine, economics and environmental studies, there are many situations wherein k independent populations are compared to a control population with respect to scale parameters [1].For example, in medicine studies, the variability of testosterone levels in men in various groups classified according to their smoking habits is compared to the variability of testosterone levels in healthy people.[2] indicated that smoking has a negative impact on testosterone levels, leading to less variability, as some of them may have low testosterone levels for other reasons, while healthy persons will have high levels.In this situation, a common problem is whether different types of smokers, e.g., former smokers, light smokers or heavy smokers, have less testosterone variability than nonsmokers.Thus, in the case of k treatment populations π 1 , π 2 , . .., π k and a control population π 0 , where the observations from the ith population follow the normal distribution N m i ; s 2 i À � ; i ¼ 0; 1; . . .; k, the interest is to test the hypothesis H 0 : s 2 i ¼ s 2 0 ; 8 i vs H 1 : s 2 i � s 2 0 ; i ¼ 1; . . .; k with at least one strict inequality; ð1Þ where s 2 i ; i ¼ 0; 1; . . .; k is the variance of the ith population.Many studies focus on multiple comparisons of treatments with a control or standard with respect to a parameter of interest under order restrictions.[3,4] proposed a standard multiple comparison procedure for comparing several treatments with a control.[5] constructed confidence intervals for comparing several normal variances with a control variance in multifactor experiments.[6] provided an algorithm for constructing multiple hypothesis testing.To improve the mean half-square successive difference statistic [7,8] proposed a modified percentile bootstrap method for comparing the variances of two independent groups.Moreover, a combination of Levene-type tests with a finite-intersection method for testing the equality of variances against ordered alternatives can be found in [9].[10] discussed the quality of F-ratio resampling tests for comparing variances.Note that all the test methods mentioned above for comparing variances have been found to be unsatisfactory in terms of type I error probabilities or powers.
To obtain an optimal test procedure that has exact control behavior in the type I error rate [11,12], recommended the use of sample quasi ranges as a measure of variance.The distribution of quasi ranges in samples from a normal population was discussed by [13].[14] classified normal populations with respect to a control using sample quasi ranges on censored data.[15] discussed optimal designs for comparing several experimental treatment variances with that of a standard treatment variance.A one-sided test based on sample quasirange was proposed by [16] to test homogeneity against a simple ordered alternative.[17] proposed a test based on isotonic estimators for testing the equality of variances of several normal populations against tree-ordered alternatives.Moreover [1], proposed an upper one-sided test based on the sample quasi range to test the homogeneity of variances from the normal population with that of the control population.By computing exact critical constants, the sample quasirange method can control the type I error rate at a preassigned level, α.
The sample quasirange approach aims at provably efficient inference, and the corresponding test can guarantee the exact type I error rate.Moreover, different from other test methods, Spurrier's optimal test [15] is a single step test procedure, while other test methods based on sample quasi ranges can be regarded as a step-up test procedure for multiple hypothesis testing.Moreover, some studies consider testing whether the variances of k populations are not equal to the variance of a control population.However, under certain circumstances, onesided simultaneous confidence intervals provide more inferential sensitivity than two-sided simultaneous confidence intervals [18].For example, some upper one-sided tests for comparing several normal variances with a control variance can be found in [1,15,[17][18][19].
For lower one-sided tests [15], provided the optimal test procedure for hypothesis test (1).More specifically, suppose that n 1 = n, n 1 = n 2 = . . .= n k = m, denote the sample variance for treatment by S 2 i ; i ¼ 0; 1; 2; . . .; k.Define the random variables and the test statistics The distribution of (F 1 , . .., F k ) is a multivariate F distribution and the marginal distribution of F i is the F distribution with m − 1 and n − 1 degrees of freedom, i = 1,2, . .., k.Letting F (1) = min(F 1 , . .., F k ), the p-value of the Spurrier test is given by where c ¼ minfs 2 1 =s 2 0 ; s 2 2 =s 2 0 ; . . .; s 2 k =s 2 0 g is the ratio of minimal sample variance among the k treatments and the sample variance with the control, H is the cdf of χ 2 (m − 1) and g is the pdf of χ 2 (n − 1).In general, (4) can be well approximated using the Gauss-Laguerre numerical quadrature using subroutines to evaluate H. Spurrier's test [15] has greater applicability while ensuring competitive efficiency, but it requires equal sample sizes among the treatment groups.[19] indicated that one may design an experiment that meets the sample size requirement, but the final available data might be unequal as a result of unexpected losses.Hence, the goal of this paper is to construct a more efficient test for general cases involving equal and unequal sample sizes based on inferential model [20] theory.

Marginal inferential model framework
Different from frequentist and Bayesian inference methods, Fisher and Dempster intended to propose prior-free inference frameworks that produce probabilistic inferential results with desirable frequency properties.However, the small sample properties for the fiducial argument [21,22] and Dempster-Shafer theory [23,24] may not be calibrated for meaningful probabilistic inference.As an alternative [20], proposed an inferential model (IM) framework for priorfree probabilistic inference.In fact, IM has some connections to fiducial inference and Dempster-Shafer theory.The key difference between these three methods is the way to work out for auxiliary variables.In particular, the IM's handling of the auxiliary variables can guarantee desirable frequency properties for all sample sizes.
In marginal inference problems, where only parts of the full parameter are of interest [25], developed a marginal inferential model (MIM) framework for marginal inference.The key idea of the MIM is to reduce the dimension of the auxiliary variable.In general, MIM starts with a system of equations, called association, representing a statistical model with unknown parameter θ = (ψ,ξ) for observable data X ~PX|θ via an auxiliary random variable U.The initial model is expressed as if the goal were to simulate, i.e., where p and a are known functions and U has a known distribution function.To emphasize that θ = (ψ,ξ), we can rewrite the association as Note that IM consists of a three-step inference procedure, which includes an association step (A-Step), a prediction step (P-Step), and a combination step (C-Step).Since ψ is the parameter of interest and ξ is the nuisance parameter, the marginal IM has the following three steps: A-Step: Suppose that there are functions q, b and c, and new variables V = (V 1 ,V 2 ) ~PV , such that (6) can equivalently be written as Since the exact value of V 2 does not provide any information about the interest parameter ψ, there is no benefit to retaining component (7b) and trying to predict the auxiliary variable V 2 .Clearly, the key idea of the MIM is that V 1 is generally of a lower dimension than U.
Therefore, the MIM for ψ is only based on the following association: The dimension-reduction step of the MIM guarantees efficient inference properties.The result of A-Step is a set-value mapping given by P-Step: Following [7], the auxiliary variable V 1 can be predicted by specifying an optimal predictive random set ) should be even better in the sense that there is a high probability that V * 1 2 SðV 1 Þ.C-Step: Combine the association and predictive random set S(V 1 ) to obtain the final predictive random subset of ψ, According to the inference framework of IM, given an assertion of interest A, MIM also provides two probabilistic measure functions about the uncertainty of A. The belief function (bel X (A)) and plausibility function (pl In fact, bel X (A) and pl X (A) can be regarded as the minimum and maximum probabilities that support the truth of assertion A. Note that the plausibility function can be easily used to create a frequentist decision rule.We can reject the null hypothesis if pl X (A) � α.Moreover, the MIM 100(1 − α)% confidence interval can be obtained by computing confidence limits from {ψ: pl X (ψ) > α}.
The IM-based method is exact in the sense that it does not need any asymptotic approximation.Moreover, IM's output has a meaningful interpretation within and not just across experiments.According to [20], the IM test is valid if Moreover, if "� α" can be replaced by "= α", then the MIM method is efficient.

The proposed MIM test
In this section, we propose a marginal IM-based method for testing hypothesis (1).Suppose that the observations X i 1 ; X i 2 ; . . .; X i n i from the ith population π i follow the normal distribution N m i ; s 2 i À � , the null hypothesis in (1) can be transferred to an assertion B ¼ fs 2 i ¼ s 2 0 ; 8 ig, where i = 0,1, . .., k, n i 2 {1, 2,3, . .., l}, k and l are positive integers.If evidence from the observable data suggests that the assertion B is false, the null hypothesis would be rejected.Let According to [26], the conditional association model based on the minimal sufficient statistics for m i ; Moreover, this association model can be equivalently simplified as Here s 2 i are parameters of interest.For any � X i ; S i ; s 2 i and U i /V i , there exist a μ i such that Since there is no direct information that can be obtained about s 2 i by knowing U i /V i , there is no benefit to retain the first equation in (16).To reduce the dimension of the auxiliary variable, we can ignore U i (i = 0, 1, . .., k) and work with auxiliary variables V i (i = 0, 1, . .., k) directly.Therefore, the initial association model of the marginal IM can be expressed as Note that the associations (17) play two distinct roles.Before experiment, the associations characterize how likely the observable S 2 i ; i ¼ 1; . . .; k to be.Once S 2 i ; i ¼ 1; . . .; k are observed, the true parameter s 2 i ; i ¼ 1; . . .; k can be obtained by solving the above equations.Clearly, the true value of V 2 i ; i ¼ 1; . . .; k will never be known, but we know exactly the distribution of V 2 i ; i ¼ 1; . . .; k.Therefore, we can predict the V 2 i ; i ¼ 1; . . .; k, so that we can make inference about s 2 i ; i ¼ 1; . . .; k.However, the difficulty we encountered are as follows: On one hand, since the unknown parameters is (k + 1)-dimentional, so the auxiliary random variable should also be (k + 1)dimentional, which might lead to poor efficiency especially when k is large.On the other hand, the hypothesis actually has constraints, i.e., Since the parameters have constraints, it is challenging to make inferences about the assertion B. Some techniques or strategies are needed.
First, we take a different perspective about (18), i.e., (18) is regarded as the parameter space rather than a constraint.This is reasonable when we look at the null hypothesis and alternative hypothesis in (1).If we take (18) as the parameter space, a straightforward transformation of ( 1) is Then the hypothesis can be transferred to We can see that ( 21) and ( 1) are equivalent under (18).Therefore, we only need to infer about assertion B = {θ = 1}.Note that assertion B only contains one-dimensional parameters.
Next, we rewrite association (17) as follows: Combining equations ( 22) and ( 24), ( 20) can be written as Clearly, the auxiliary variable According to [27], the distribution of where u~Unif(0,1) and F −1 is the inverse function of F. P-Step: Different kinds of assertions have different expression forms of the corresponding valid predictive random sets.For a left-sided assertion (21), a possible optimal predictive random set S(u) for predicting the auxiliary variable u, that is, Theorem 1.According to [20], for a left-sided assertion, the predictive random set S(u) = {u: 0 � u � U},U~Unif(0, 1) is optimal in the sense that P S(u) {u 2 S(u)} ~Unif(0,1). Proof.
The predictive random set S (u) is optimal in the sense that P U {Q S(u) (U) � 1 − α} = α for each α 2 (0,1); hence Hence, the proof is complete.

C-
Step: Combine ( 26) and ( 27), we have Then, the plausibility function for assertion Theorem 2. The proposed MIM inference method can control the type I error rate at a preset level α 2 (0,1), i.e., Hence, the proof is complete.

Simulation study
The proposed marginal IM method is an accurate testing method, and the efficiency of MIM inference does not require simulation verification.However, the p value of Spurrier's optimal test needs to be approximated using the Gauss-Laguerre numerical quadrature.In fact, it is an approximate method.Moreover, due to the asymptotic properties of large samples, the performance of the two test methods tends to be consistent.For better comparison, we conduct Monte Carlo simulations to assess the performances of the MIM-based test and Spurrier's test in various small sample situations.The parameters of the comparison are mainly the type I error rate and power.The parameter settings refer to [20] and [10].In the experiment, we consider various cases with different sample sizes (n 0 , n 1 , . .., n k ) and different treatment groups s 2 0 ; s 2 1 ; . . .; s 2 k À � for k = 3, 5, and 7.In each case, we repeat the experiment 10,000 times to evaluate the significance level of 5% which provides a 95% confidence interval of the type I error rate as (0.0457, 0.0543) for a 5% error rate.Tables 1 to 3 summarize the results of the comparisons of the type I error rates and empirical powers of these two methods for small sample sizes.Note that the first component of the combination (n 0 , n 1 , . .., n k ) is for the control group; for example, (n 0 , n 1 , n 2 , n 3 ) = (5,6,7,8) represents that the control group has five samples.
When the sample sizes (n 0 , n 1 , . .., n k ) are equal, it is easy to see that both the MIM test and the Spurrier test can exactly control the type I error rate.Moreover, the powers and the Type I error rates of the MIM test and Spurrier test are almost the same.One possible reason is that both tests utilize the same information from the given data.Specifically, they calculate the minimum sample variance among the treatment groups and the sample variance within the   29) is connected to the ordered statistics of the multivariate F distribution.However, Spurrier's test does not work when the sample sizes of the treatment groups are unequal.In these cases, the proposed MIM test also has a type I error rate at a preset level, α 2 (0.0457, 0.0543).Therefore, the MIM test outperforms Spurrier's test because it has more flexible applicability while maintaining competitive efficiency.

Applications
In this section, we use two real examples to illustrate the proposed MIM test.
Example 1 [2] considered four groups of men in the 35-45 years age bracket: (i) nonsmokers (control), (ii) former smokers (treatment), (iii) light smokers (treatment), and (iv) heavy smokers (treatment).Each group consisted of ten men and Table 4 shows the testosterone levels measured in μg/dl.It is known that smoking has a negative impact on testosterone levels, leading to less variability, as some of them may have low testosterone levels for other reasons, while healthy people will have high levels.One question then asked is whether any of the testosterone levels among the three groups of smokers (including former smokers, light smokers and heavy smokers) have less variability than nonsmokers.
By calculation, the p-values of the four groups by the Shapiro-Wilk normality tests are 0.5540, 0.6516, 0.4525 and 0.2398, respectively, and we accept the normality assumption for the data.These four groups give sample variances of 0.0520, 0.0389, 0.0250 and 0.0075.Moreover, the MIM and Spurrier tests have the same p-value = 0.0117.Hence, we can reject the null hypothesis at significance level α = 0.05, i.e., at least one smoking group has less variability in testosterone levels than the nonsmoking group.
Different from other methods, the MIM method has a significant advantage in that it can provide probabilistic summaries of the information in data concerning the quantity of interest B. To be more informative, we plot the plausibility function pl(B), where B = {θ}, as a function of θ in Fig 5 .By locating α on the vertical axis, the corresponding 1 − α MIM confidence interval can be easily obtained.More importantly, each point in the MIM interval is individually sufficiently plausible.
Example 2 To demonstrate the flexibility of the MIM method, the second dataset shown in Table 5 [3] contains blood count measurements on three groups of animals, one of which served as a control while the other two were treated with two drugs.Since the Shapiro-Wilk normality tests of the three groups give p-values of 0.4834, 0.6942 and 0.5483, we accept the normality assumption for this dataset.Moreover, the sample variances of the control, Drug A and Drug B groups are 0.8841, 0.8165 and 2.4240, respectively.The problem of interest is to test whether the variability in blood counts following treatment with Drug A and Drug B is smaller than that in the control.Due to accidental losses, existing methods cannot be applied to these data because the numbers of animals in the three groups are unequal.As an alternative, from Fig 6, the plausibility function of the MIM gives a plausibility of 0.6804 in this situation.Hence, there is no significant evidence suggesting rejecting the null hypothesis that the variability in blood counts following treatment with Drug A and Drug B is the same as that of the control.

Discussion
In applied statistics, there often exists a common and important problem comparing the variances of experimental treatments with that of a standard or control treatment under the assumption that the measurements are independent and normally distributed.The existing test procedures are not well developed for testing the homogeneity of variances under a control  group and require the sample sizes to be equal.In real data analysis, the requirement for equal sample sizes of experimental treatments may not be satisfied because of accidental losses or other unexpected circumstances.To date, a more general test method is needed.It is crucial to construct an appropriate nonprior, frequency calibrated testing method.In this paper, we propose a new test method based on the marginal inferential model framework.The proposed MIM method has at least three contributions as follows: first, different from the general IM framework, the new MIM method utilizes partial information from the null hypothesis to construct accurate testing methods.This idea complements the precise theory of statistical inference.Second, the constructed association model in (26) is the key to accurate inference of the MIM.Since the distribution of the righthand side of equation (25) does not rely on the observable data, we demonstrate that the MIM method has an accurate type I error rate but does not require simulation verification.Finally, the MIM test does not require the sample size among the treatment groups to be equal, while other tests require the sample size to be equal.Note that the MIM has an advantage in providing valid probabilistic uncertainty quantification.Unlike the p value of Spurrier's test, the output of the MIM test, i.e., plausibility, is posterior-probabilistic in nature and therefore has a meaningful interpretation within and  not just across experiments.Therefore, the plausibility function provides more information than the p value of the frequentist approach because even a large p value cannot "confirm" the truth of the null hypothesis.
As we focus on comparing the variances of the normal distribution, this implies that the underlying distribution of the data needs to be normally distributed.A potential challenge of the proposed procedure (as well as other methods based on parametric models) is that we do not know the accurate distribution of the data.In our real-data applications, we apply the Shapiro-Wilk normality test to test whether the data are normally distributed.Even though the p values in two real-data examples are greater than a specified significance level, the data are still likely to be nonnormal, especially with a relatively small sample size.One possible way to alleviate this challenge is to increase the sample size of the data.
Different from traditional IM-based test methods, the proposed MIM solution uses part of the information given in the null hypothesis to reduce the dimension of the auxiliary variable and gains more validity.This idea could be applied to other multiple comparison procedures for comparing several treatments with a control.For instance, the methodology can be extended to the comparison of the mean of the normal distribution, where the mean is of interest and the variance is a nuisance parameter.In this case, the marginal association (e.g., Eqs ( 17) and ( 24)-( 26)) would be equations concerning the normal mean.Comparison for the mean is more complicated since different marginalization techniques might result in varying inferential results.Indeed, we are looking for the best marginalization technique for the comparison of the mean.Some studies are still ongoing.Finally, for two-sided tests, since the dimension of the auxiliary variable is two-dimensional, there could be interest in the simultaneous prediction of several auxiliary variables.The optimal predictive random set needs further study.
does not rely on the observable data S 2 0 ; S 2 1 ; . . .; S 2 k � � .Let F denote the distribution function of the right side of equation (25), and we can construct a new MIM-based procedure to make inferences about assertion B = {θ = 1} as follows: A-Step: The final association model of the MIM is given by